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The  Fort  Leavenworth  Field  Unit  of  the  U.S.  Amy  Research  Institute  for 
the  Behavioral  and  Social  Sciences  (ARI)  conducts  a  research  program  in  support 
of  the  Combined  Arms  Center,  which  includes  the  Combined  Arms  Combat  Develop¬ 
ments  Activity  (CACDA)  and  the  Command  and  General  Staff  College  (CGSC). 


The  Field  Unit  is  presently  Involved  in  assisting  the  local  combat  modeling 
community  with  the  representation  of  human  performance  variables  in  land  combat 
models.  In  the  absence  of  performance  measurements  under  realistic  conditions, 
combat  modelers  often  resort  to  the  use  of  panels  of  military  experts  to 
provide  estimates  of  how  performance  would  be  affected  by  various  situational 
factors.  The  present  investigation  explored  the  validity  of  such  judgments 
by  asking  for  estimates  of  performance  on  military  tasks  in  situations  where 
data  from  controlled  field  exercises  exist.  This  investigation  is  responsive 
to  the  objectives  of  Army  Project  2Q162717A790  concerned  with  the  Improvement 
of  command  and  control  procedures  and  systems. 


Objective: 


To  determine  the  feasibility  of  supplementing  human  performance  data  used 
In  land  coinbat  models  with  estimates  of  soldier  performance  In  adverse  environ¬ 
ments. 


Procedure: 

Estimates  of  specific  performance  were  coiq>ared  with  actual  performance 
data  from  previous  studies.  Three  comparison  studies  were  selected:  (a) 
ENDURE,  where  tank  crews  performed  simulated  combat  tasks  over  a  48-hour 
period,  (b)  a  laboratory  Investigation  In  %ihlch  fire  direction  center  (FDC) 
teams  underwent  up  to  48  hours  of  simulated  sustained  coiid>at  operations,  and 
(c)  Early  Call  I  and  Early  Call  II,  where  parachute  platoons  performed  a 
sustained  tactical  defensive  exercise  In  Great  Britain  for  up  to  five  days 
without  sleep.  Detailed  descriptions  of  the  performance  tests  along  with 
average  scores  or  times  for  the  first  time  period  were  given  to  students 
from  the  Combined  Arms  and  Services  Staff  School  (CAS3).  The  CAS^  students 
estimated  the  scores  for  the  second,  third,  and  fourth  periods. 


Principal  Findings: 

1.  The  officers  agreed  strongly  among  themselves  In  their  predictions 
of  performance.  This  was  shown  by  extremely  high  Intraclass  correlations. 

2.  The  officers'  predictions  of  performance  did  not  reflect  actual 
performance  measures  obtained  In  the  original  field  exercises.  The  expert 
raters'  predictions  of  performance  were  significantly  different  from  the 
actual  performance  measures. 

3.  Expert  raters'  estimates  were  no  more  accurate  for  performance  after 
12  hours  than  after  24,  36,  or  48  hours  of  continuous  operations. 

4.  Expert  raters'  predictions  of  performance  were  more  accurate  for 
cognitive  and  vigilance  tasks  than  for  simple  motor  tasks. 
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Military  Experts'  Estimates 
of  Continuous  Operations  Performance 
(or  Close  but  no  Cigar) 


Karen  L.  Neff 
and 

Robert  E.  Sollck 


INTRODUCTION 

Attempts  to  calibrate  or  to  assess  the  validity  of  expert  Judgments  have 
led  to  little  conclusive  evidence  for  experts'  abilities  to  make  predictions  of 
random  events,  except  for  meteorologists  who  are  surprisingly  good  at  making 
familiar,  common-place  types  of  forecasts  (Lichtenstein,  Flschhoff,  and 
Phillips,  1977;  Murphy  and  Winkler,  1977).  Predictions  for  securities  and  the 
stock  market  generally  are  little  better  than  a  simple  random  model  or  a  no- 
Informatlon  strategy  (Borcherding,  1978).  Estimates  regarding  other  events  and 
other  types  of  estimates  generally  lie  somewhere  between  these  two  extremes. 

In  general  most  types  of  estimates  and  predictions  have  been  shown  to  be 
subject  to  numerous  types  of  biases  on  the  part  of  the  expert  which  play  havoc 
with  his  estimates.  (Bowonder,  1981;  Elnhorn  and  Hogarth,  1981;  Lichtenstein, 
Flschhoff,  and  Phillips,  1977;  Slovlc,  Flschhoff  and  Lichtenstein,  1979a; 
Tversky,  1969;  Tversky  and  Kahneman,  1974).  There  Is  evidence  that  these 
biases  can  be  reduced  through  training,  particularly  when  there  are  short, 
clear  links  between  the  action  and  Its  outcome  (Elnhorn  and  Hogarth,  1978; 
Hogarth  and  Makrldakls,  1981);  the  excellent  predictions  made  by  weather 
forecasters  are  evidence  for  such  an  effect.  There  Is  also  evidence  that  the 
context  and  form  of  the  Instigation  for  the  prediction  can  have  a  significant 
effect  upon  the  decision  made  (Kahneman  and  Tversky,  1981,  1982a,  1982b). 

It  Is  fairly  common  practice  within  the  military  to  use  experts'  estimates 
as  shortcuts  In  decision  making  or  as  predictions  of  later  performance  (Uhlaner 
and  Drucker,  1980).  However,  few  attempts  have  been  made  to  determine  the 
accuracy  of  the  estimates  or  decisions.  Harman  and  Press  (1975)  provide 
guidelines  for  collecting  and  analyzing  judgments  from  groups  of  experts.  They 
also  provide  recommendations  for  selecting  a  panel  of  experts.  They  note  that 
the  Ideal  method  to  assess  the  validity  of  predictions  Is  "to  compare  them  with 
actual  outcomes"  (p.lO).  However,  there  are  difficulties  in  Implementing  this 
approach,  particularly  when  one  Is  attempting  to  make  forecasts  in  the  first 
place.  Harman  and  Press  (1975)  recommend  the  use  of  a  pilot  study  to  establish 
the  validity  of  the  predictions  wherever  possible. 

Ryan- Jones  (1979)  did  attempt  to  evaluate  the  validity  of  military 
experts'  judgments.  His  was  the  only  such  attempt  available.  He  compared 
opinions  of  squad  leaders  and  platoon  leaders  regarding  task  difficulty 
against  the  percentage  of  soldiers  failing  a  criterion-referenced  test  on  the 
same  tasks.  He  found  a  non-slgnif leant  correlation  between  the  expert 


ratings  and  the  independent  measure  of  task  difficulty.  There  was  an  appar¬ 
ent,  but  not  statistically  significant,  trend  toward  rating  difficult  tasks 
as  easy. 


The  purpose  of  the  research  described  here  was  to  provide  additional 
evidence  regarding  the  accuracy  of  military  experts'  predictions.  We  hoped 
to  determine  the  feasibility  of  supplementing  human  performance  data  used  in 
land  combat  models  with  estimates  of  performance  in  adverse  environments  when 
no  hard  data  is  available. 


Background 

It  is  widely  recognized  that  the  performance  of  soldiers  is  a  prime 
determinant  of  the  effectiveness  of  weapons,  units, and  forces  in  battle. 

Yet,  land  combat  models  only  recently  have  come  to  reflect  human  performance 
variations  and  limitations.  Factors  relating  to  human  performance  in  combat 
include  the  state  of  the  soldier  (training,  morale,  fatigue,  fear),  the  state 
of  the  environment  (precipitation,  temperature,  visibility)  and  the  quality 
of  command  and  control  (as  reflected  in  planning,  decision  making,  intel¬ 
ligence  gathering  and  communicating,  as  well  as  more  charismatic  aspects  of 
leadership).  These  variables  must  be  shown  to  Influence  battle  outcomes 
through  their  relations  to  traditional  model  constructs  such  as  the  prob¬ 
ability  of  a  hit,  the  vulnerability  of  a  unit  or  system,  the  likelihood  that 
a  system  will  be  in  good  repair  (thus  able  to  participate  in  the  battle),  the 
probability  that  orders  will  be  received,  and  the  time  required  for  functions 
like  movement  and  construction  of  defenses.  In  almost  all  cases,  the 
relationships  between  the  human  factors  and  these  model  constructs  are  not 
known  quantitatively,  although  the  general  direction  and  probable  magnitude 
of  the  relationships  can  sometimes  be  deduced. 

For  many  human  variables  of  interest,  such  as  performance  under  extreme 
stress,  it  is  virtually  impossible  to  gather  data  on  tactical  performance 
under  realistic  conditions.  For  others,  such  as  fatigue  or  level  of  train¬ 
ing,  it  is  feasible,  but  expensive  and  time  consuming,  to  gather  this  data. 
Since  the  construction  of  crew  or  operator  models  to  relate  the  physiological 
and  psychological  state  of  the  soldier  to  system  performance  measures  used  in 
combat  models  requires  such  data  on  every  task,  the  model  developer  may 
resort  to  the  use  of  quantitative  estimates  of  performance  as  a  surrogate  for 
performance  data. 

The  accuracy  of  such  estimates  is  of  primary  importance.  Therefore,  it 
was  proposed  to  determine  whether  military  experts  can  make  quantitative 
estimates  of  sufficient  accuracy  for  formulation  of  functions  for  incor¬ 
poration  in  the  models. 

The  initial  effort  reported  here  evaluated  the  accuracy  of  military 
experts'  performance  estimates  by  examining  the  amount  of  convergence  between 
samples  of  estimates  and  the  performance  values  obtained  in  field  exercises, 
and  by  examining  the  amount  of  agreement  among  personnel  familiar  with 
tactical  tasks  concerning  variations  in  human  performance. 


METHOD 


Numerical  estimates  were  gathered  from  a  sample  of  Army  officers,  29 
students  from  the  Combined  Arms  and  Services  Staff  School  (CAS3),  Fort 
Leavenworth.  Nine  had  armor  experience,  nine  had  artillery  experience,  and 
eleven  had  Infantry  experience.  All  had  at  least  one  year  of  experience  at 
the  company  command  level. 

Three  different  field  exercises  were  selected  for  use  as  comparison  data 
for  experts’  predictions  of  troop  performance:  (1)  work  unit  ENDURE,  in 
which  tank  crews  were  required  to  perform  simulated  tasks  over  a  48-hour 
period  (Ainsworth  &  Bishop,  1971);  (2)  a  laboratory  study  in  which  Fire 

Direction  Center  (FDC)  teams  underwent  up  to  48  hours  of  simulated,  sustained 
combat  operations  (Banderet  &  Stokes,  1980;  Banderet,  Stokes,  Francesconl, 
Kowal,  &  Naltoh,  1980);  and  (3)  field  exercises  conducted  In  Great  Britain, 
Exercise  Early  Call  I  and  Exercise  Early  Call  II,  where  parachute  platoons 
performed  a  sustained  tactical  defensive  exercise  for  nine  days  (Haslam, 

1978,  1980,  1981,  1982;  Haslam,  Allnutt,  Worsley,  Dunn,  Abraham,  Few,  Lubuc, 

&  Lawrence,  1977).  These  three  exercises  vere  selected  because  they  were  the 
only  ones  available  that  measured  sustained  performance  on  military  tasks 
while  attempting  to  hold  constant  other  factors  which  might  affect  perform¬ 
ance. 

Three  questionnaires  were  developed  based  upon  these  three  exercises. 
Expert  raters  for  each  questionnaire  type  had  experience  In  the  question¬ 
naire's  specific  type  of  activity.  Each  questionnaire  provided  a  detailed 
description  of  the  context  In  which  performance  tests  were  conducted. 
Information  regarding  the  test  conditions,  the  experimental  procedures,  the 
type  of  personnel  who  participated,  and  the  time  schedule  was  outlined.  Each 
of  the  three  types  of  questionnaires  contained  descriptions  of  particular 
tests  administered  to  the  troops  participating  in  the  continuous  operations 
exercises.  The  descriptions  Included  how  each  test  was  scored  or  timed,  and 
the  average  score  and  time  obtained  for  the  first  period.  After  each 
description  of  a  task,  the  participants  were  asked  to  estimate  the  average 
score  and  average  time  obtained  by  the  soldiers  for  the  second,  third,  and 
fourth  time  periods.  Where  there  was  a  maximum  score  which  could  be  obtained 
for  perfect  performance,  the  maximum  score  was  given  as  an  additional  anchor. 

Questionnaires  were  administered  in  group  sessions  by  CAS^  section 
leaders.  Participants  were  briefed  on  the  potential  of  the  research  for 
applications  In  modeling  and  doctrine.  They  were  asked  to  complete  the 
questionnaire  on  their  own  without  conferring  with  other  persons  or  doc¬ 
uments.  The  participants  were  instructed  to  base  their  responses  upon  their 
own  knowledge,  experience,  and  training. 


RESULTS 


Figures  1  through  40  in  Appendix  A  show  averages  of  rater  estimates  of 
performance  and  field  exercise  measures  of  performance  as  functions  of  time 
for  each  of  the  tests  from  the  armor  and  from  the  artillery  continuous 
operations  field  exercises.  Figures  for  all  of  the  tests  from  Early  Call  I 
and  Early  Call  II  were  not  Included  because  some  of  the  Information  regarding 
the  field  exercises  was  sensitive;  only  test  exercises  which  have  appeared  In 
the  open  literature  were  Included  as  figures.  These  figures  show  raters  with 
armor  experience  overestimated  actual  f leld-exerclse  performance  data  In 
three  cases  (Figures  1,  2,  and  7),  underestimated  performance  data  In  eight 
cases  (Figures  4,  6,  10,  11,  12,  13,  14,  and  15),  made  both  overestimates  and 
underestimates  depending  upon  the  time  period.  In  four  cases  (Figures  3,  5, 

16,  and  17)  and  made  fairly  accurate  predictions  across  all  time  periods  In 
one  case  (Figure  9).  Figures  18  through  33  show  raters  with  artillery 
experience  In  general  overestimated  actual  performance  data  In  three  cases 
(Figures  18,  24,  and  33),  underestimated  actual  performance  data  In  eight 
cases  (Figures  22,  23,  26,  27,  28,  29,  30,  and  31)  and  both  overestimated  and 
underestimated  actual  performance  data,  depending  upon  the  successive  time 
period.  In  five  cases  (Figures  19,  20,  21,  25,  and  32).  Figures  34  through 
40  show  that  for  Early  Call  I  and  II  raters  consistently  overestimated  the 
decrement  In  performance  (underestimated  actual  performance  data)  with 
successive  time  periods  In  all  cases  but  one  (Figure  40). 

Inter-rater  agreement  for  the  performance  ratings  was  estimated  by  the 
Intraclass  correlation  for  each  performance  measure  In  each  task  In  each 
exercise  (see  Table  1).  Munnally  (1967)  contended  that  reliabilities  of  .60 
or  .50  will  suffice  for  exploratory  research.  Table  1  shows  acceptable 
levels  of  Interrater  reliability  for  nearly  all  the  tasks.  Sufficient 
agreement  among  the  raters  with  armor  experience  for  project  ENDURE  was  found 
for  all  but  five  of  the  estimates  of  performance  measures:  (a)  ditch  crossing 
time,  (b)  log  crossing  time,  (c)  firing  accuracy  of  the  main  gun  on  a 
stationary  target  with  the  tank  stationary,  (d)  firing  accuracy  of  the  main 
gun  on  a  moving  target  with  the  tank  stationary,  and  (e)  accuracy  of  completion 
of  the  maintenance  checklist.  For  the  artillery  tasks,  sufficient  agreement 
was  found  for  all  of  the  tasks,  except  preplanning  errors  that  exceeded  990 
mils.  Very  high  Interrater  agreement  was  found  among  officers  with  Infantry 
experience.  All  predictions  for  Early  Call  .1  and  Early  Call  II  had  Intra¬ 
class  correlations  of  nearly  .80  or  greater. 

Since  the  response  scales  In  the  field  and  laboratory  exercises  varied 
from  task  to  task,  the  estimate  scales  also  varied  according  to  task.  Either 
the  response  measure  had  to  be  treated  as  a  separate  dependent  variable  for 
any  parametric  analysis  of  the  responses,  or  the  response  scales  had  to  be 
converted  to  a  common  measurement  scale.  Estimates  for  each  performance 
measure  were  treated  as  separate  dependent  variables  for  the  Intraclass 
correlation  computations. 

Confidence  Intervals  based  upon  the  ^-distribution  were  constructed  to 
determine  whether  the  estimated  values  were  significantly  different  from  the 
actual  performance  values  obtained  In  the  field  exercises.  Computations 


Reliability  of  Perfornance  Estimates 
by  the  Intraclass  Correlation 


Task 


Intraclass  correlation® 


Armor 


Driving  exercises 


Mlnef leld-tlme 
Mlnef leld -accuracy 
Slalom-time 
Slalom-accuracy 
Ditch-time 
Ditch-accuracy 
Log- time 
Log -accuracy 


Gunnery  exercises 


Main  gun-stationary  tank,  stationary  target-time 
Main  gun-stationary  tank,  stationary  target-accuracy 
Main  gun-stationary  tank  moving  target-time 
Main  gun-stationary  tank  moving  target-accuracy 
Caliber  .SO  machlnegun -stationary  tank,  moving 
target-time 

Caliber  .50  machlnegun-statlonary  tank,  moving 
target -accuracy 

Coaxial  machlnegun-fflovlng  tank,  stationary  target- 
accuracy 


\ 


P  •  urn. 


Artillery  (Continued) 


Unplanned  missions  computation-latency 


Preplanning  latency  .90 

Preplannlng-number  . 60 

Preplanning  errors 

Greater  than  990  mils  ** 

90-990  mils  .89 

30-89  mils  .82 

15-29  mils  .87 

7-14  mils  .87 

On-call  mission  response  latency  .97 


Early  Call  I 


Grouping  capacity  .86 
Marching  . 98 
Weapon-handling  tests 

Time  to  fill  magazine  by  hand  .91 
Time  to  load  rifle,  standing  .94 
Time  to  unload  rifle,  standing  .93 
Time  to  strip  rifle  to  firing  pin  .92 
Time  to  assemble  rifle  .94 
Average  score-strip  and  reassemble  .81 

Vigilance  shooting  .97 

Coimander  ratings  of  military  effectiveness  .99 


Hours  to  withdraw 


.98 


Task 


Intraclass  correlation 


Early  Call  II 


Vigilance  shooting 


Vigilance  with  night  sight 


Percentage  detected 
False  alarms 

Percentage  detected-teams 
False  alarms-teams 


Moving  target  shooting 


Hits 

Shots  leading 
Shots  lagging 


Grouping  capacity 


lntracl4S9  correl^tlonp  of  ,5  or  greater  are  considered  acceptable 
levels  of  Inter T-rater  reliability  (Nunnally,  1967). 
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Intraclass  correlations  were  computed  using  Ebel*s  formula,  Negative 
values  result  for  F  ratios  less  than  one.  They  do  not  connote  an  Inverse 
relationship,  but  should  be  consldeved  equal  to  zero  for  purposes  of 
Interpretation',  they  are  Indicated  by  double  asterisks. 
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treating  each  performance  measure  as  a  separate  dependent  variable  would  have 
resulted  In  a  prohibitively  large  number  of  tests.  Therefore,  to  create  a 
common  scale  for  all  the  measures,  performance  predictions  were  converted  to 
ratios  by  dividing  the  rater  estimate  by  the  performance  score  obtained  in 
the  field  exercise.  Log  transforms  of  the  ratios  eliminated  their  positive 
skew  (Nickerson,  1981).  If  the  rater  estimate  was  greater  than  the  actual 
performance  score,  the  transformed  value  was  positive.  If  the  estimate  was 
less  than  the  actual  performance  score,  the  transformed  value  was  negative. 
The  log  transform  of  the  ratio  was  zero  if  the  predicted  score  was  equal  to 
the  actual  performance  score. 

Confidence  Intervals  based  upon  the  ^-distribution  were  constructed  for 
the  log  transforms  of  the  ratios  of  predicted  to  actual  performance.  Con¬ 
fidence  intervals  which  Included  zero  (the  logio  of  1)  as  a  possible  mean 
would  indicate  that  the  predicted  values  were  not  significantly  different 
from  the  actual  performance  values  obtained  in  the  field  exercises.  As  can 
be  seen  in  Table  2,  none  of  the  confidence  intervals  Included  zero.  There¬ 
fore,  the  expert  raters'  predictions  were  significantly  different  from  the 
actual  performance  measures  for  soldiers  participating  in  the  four  field 
exercises. 

Three  performance  measures  on  the  questionnaires  and  included  in  the 
Intraclass  correlation  computations  were  not  considered  in  any  further  stat¬ 
istical  analyses.  On-call  mission  response  latency  predictions  made  by 
artillerymen  were  excluded  because  30  was  inadvertently  given  as  the  anchor 
for  their  predictions,  rather  than  10,  which  was  the  correct  average  response 
latency  for  on-call  missions  in  the  field  exercises.  Even  through  given  an 
anchor  inflated  by  300%,  the  artillerymen's  estimates  of  that  item  had  an 
Intraclass  correlation  of  .97.  Two  measures  from  Early  Call  I,  commander 
rating  of  military  effectiveness  and  observer  i;ating8  of  marching  perform¬ 
ance,  were  excluded  because  their  scale  of  measurement  did  not  justify 
conversion  to  ratios. 

Results  of  ANOVAs  for  the  four  field  exercise  show  significant  main 
effects  for  tasks  for  armor  (£  ■  5.5936;  ^  •  16.32;  £<  .01),  artillery 
(F  =  11.8140;  ^  -  14.28;  £<  .01),  and  Early  Call  I  (F  -  8.4367;  «  -  7.21; 
£<.01).  Main  effects  for  tasks  were  not  significant  for  Early  Call  II. 

Main  effects  for  fatigue  were  not  significant  for  any  of  the  four  question¬ 
naire  types.  Simple  effects  were  explored  using  Tukey's  HSD  (honesty 
significant  difference  test)  to  make  pairwise  comparisons  between  means. 

Statistically  significant  differences  in  the  officers'  ability  to 
predict  performance  for  armor  groups  were  found  between  the  following  tasks: 
(a)  minefield  accuracy  and  slalom  accuracy,  (b)  minefield  accuracy  and  ditch 
accuracy,  (c)  minefield  accuracy  and  log  accuracy,  (d)  minefield  accuracy  and 
main  gun  time  with  stationary  tank  and  stationary  target,  (e)  minefield 
accuracy  and  caliber  .50  machinegun  accuracy  with  stationary  tank  and  moving 
target,  (f)  log  accuracy  and  minefield  time,  (g)  log  accuracy  and  slalom 
time,  (h)  log  accuracy  and  main  gun  accuracy  with  stationary  tank  and 
stationary  target,  (1)  log  accuracy  and  main  gun  time  with  stationary  tank 
and  moving  target,  (J)  log  accuracy  and  caliber  .50  machinegun  time  with 
stationary  tank  and  moving  target,  (k)  log  accuracy  and  coaxial  machinegun 
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Table  2 

Means,  Standard  Error  of  the  Means  and  Confidence 
Intervals  for  Log  Transforms  of  Ratios  of 
Predicted  to  Actual  Performance 


Armor 

Artillery 
Early  Call  I 
Early  Call  II 


.0568 


.1310 


.0519 


Standard 
Error  of 
the  Mean 

df 

^  Distribution 
Confidence  Interval 

.0186 

49 

C.99 

|.0114<X<. 1022 

.0200 

43 

C.99 

|.025l£X<.0831 

.0362 

30 

C.99 

).  031 4<%.  2306 

.0474 

25 

C.99 

|.0802<X<.1840 

Note;  The  means  of  the  distributions  of  ratios  are  zero  (the  log^o  of  1) 
If  the  predicted  values  are  equal  to  the  actual  performance  scores 
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accuracy  with  moving  tank  and  stationary  target,  (1)  caliber  .50  machlnegun 
accuracy  with  stationary  tank  and  moving  target  and  slalom  accuracy,  (m) 
caliber  .50  machlnegun  time  with  stationary  tank  and  moving  target  and  main 
gun  time  with  stationary  tank  and  stationary  target,  (n)  caliber  .50  machine^ 
gun  time  with  stationary  tank  and  moving  target  and  caliber  .50  machlnegun 
accuracy  with  stationary  tank  and  moving  target,  (o)  caliber  .50  machlnegun 
accuracy  with  stationary  tank  and  moving  target  and  slalom  time,  (p)  caliber 
.50  machlnegun  accuracy  with  stationary  tank  and  moving  target  and  main  gun 
accuracy  with  stationary  tank  and  stationary  target,  (q)  caliber  .50  machine- 
gun  accuracy  with  stationary  tank  and  moving  target  and  main  gun  time  with 
stationary  tank  and  moving  target. 

Statistically  significant  differences  in  the  officers'  ability  to  predict 
performance  for  artillery  teams  were  found  between  prioritizing  latency  and 
each  of  the  other  tasks:  (a)  prioritizing  latency  and  prioritizing  number, 

(b)  prioritizing  latency  and  unplanned  mission  errors  greater  than  990  mils, 

(c)  prioritizing  latency  and  unplanned  missions  computation  latency,  (d) 

prioritizing  latency  and  unplanned  mission  errors  from  90  to 

prioritizing  latency  and  unplanned  mission  errors  from  30  to  89  mils,  (f) 

prioritizing  latency  and  unplanned  mission  errors  from  15  to 

prioritizing  latency  and  unplanned  mission  errors  from  7  to  14  mils,  (h) 

prioritizing  latency  and  preplanning  latency,  (1)  prioritizing  latency  and 
preplanning  number,  (j)  prioritizing  latency  and  preplanning  errors  greater 
than  990  mils,  (k)  prioritizing  latency  and  preplanning  errors  from  90  to 
990  mils,  (1)  prioritizing  latency  and  preplanning  errors  from  30  to  89  mils, 
(m)  prioritizing  latency  and  preplanning  errors  from  15  to  29  mils,  and  (n) 
prioritizing  latency  and  preplanning  errors  from  7  to  14  mils. 

Using  Tukey's  HSD  test,  statistically  significant  differences  in  the 
officers’  ability  to  predict  infantry  performance  In  Early  Call  I  were  found 
between  vigilance  shooting  and  all  of  the  other  tasks,  except  the  weapon 
handling  average  score:  (a)  vigilance  shooting  and  grouping  capacity,  (b) 
vigilance  shooting  and  time  to  fill  magazine  by  hand,  (c)  vigilance  shooting 
and  time  to  load  rifle,  (d)  vigilance  shooting  and  time  to  unload  rifle,  (e) 
vigilance  shooting  and  time  to  strip  rifle  to  firing  pin,  and  (f)  vigilance 
shooting  and  time  to  assemble  rifle. 


DISCUSSION 

The  performance  predictions  were  examined  in  terms  of  two  questions: 

Did  the  officers  agree  among  themselves  In  their  predictions  of  performance, 
and  did  the  officers'  predictions  of  performance  reflect  actual  performance 
measures  obtained  In  the  original  field  exercise?  The  answer  to  the  first 
question  was  yes;  the  answer  to  the  second  question  was  no. 

Intra-class  correlations  revealed  high  Inter-rater  reliabilities  for 
most  items  for  each  questionnaire  type. 


Confidence  Intervals  based  upon  Che  ^^-distribution  showed  Chat  the  expert 
raters'  predictions  were  significantly  different  from  the  actual  performance 
measures  for  soldiers  participating  in  the  four  field  exercises. 

ANOVAs  revealed  no  significant  difference  in  the  raters'  ability  to  predict 
performance  scores  as  a  function  of  length  of  sustained  performance.  The  raters 
were  not  significantly  more  accurate  in  their  estimates  of  soldier  performance 
after  the  soldiers  had  undergone  12  hours  of  continuous  operations  than  after 
the  soldiers  had  undergone  24  hours  or  48  hours  of  continuous  operations. 


ANOVAs  did  reveal  that  the  raters*  predictions  were  significantly  more 
accurate  for  some  of  the  tasks  than  for  others.  This  was  true  of  each  question¬ 
naire  type,  except  Early  Call  II  where  no  difference  was  found  among  tasks  in 
the  accuracy  of  predictions. 

Consistent  agreement  on  incorrect  predictions  is  fairly  clear  evidence  of 
systematic  blas.^  To  explore  the  nature  of  this  bias,  a  post  hoc  analysis  was 
performed  on  the  performance  estimates  provided  for  exercise  Early  Call  II. 
Briefly,  it  was  hypothesized  that  the  estimation  task  was  too  difficult  to  per¬ 
form  based  upon  past  experience  with  the  military  tasks  and  that  the  estimators 
as  a  group  adopted  a  simplification  strategy,  basing  their  predictions  upon  a 
simple,  but  inappropriate,  qualitative  model  of  the  effect  of  fatigue  on 
performance.  Four  such  models  were  explored.  The  degree  of  fit  was  determined 
by  classifying  each  individual's  predictions  for  each  of  nine  tasks  in  Early 
Call  II  according  to  whether  or  not  they  violated  any  of  the  asstmptions  of 
each  model. 

The  least  restrictive  model  assumed  that  performance  would  remain  the  same 
or  deteriorate  with  increasing  levels  of  fatigue,  but  that  it  would  not  get 
better.  This  model  fit  85  of  the  99  cases,  where  a  case  consisted  of  three 
predictions  about  a  single  task  by  one  rater.  One  person  accounted  for  8  of  the 
14  deviations  from  the  model  by  consistently  predicting  that  performance  would 
recover  in  the  last  time  interval  of  the  field  experiment. 

The  next  least  restrictive  model  assumed  strictly  decreasing  performance 
with  increasing  levels  of  fatigue.  This  model  was  consistent  with  67  of  the 
99  cases.  One  additional  person  consistently  violated  this  model  by  assuming 
no  decrease  in  performance  from  the  first  to  the  second  time  interval. 

Two  more  restrictive  models  were  examined  as  approximations  to  the  effects 
of  fatigue  on  vigilance  tasks,  since  these  tasks  are  more  prone  to  fatigue 
effects  and  thus  more  likely  to  be  within  the  experience  of  the  predictors. 

The  partially  ordered  intervals  model  assumed  that  performance  would  get  worse 
over  successive  time  periods  and  that  the  amount  by  which  it  worsened  would  be 


^e  obvious  alternative  explanation 
themselves  despite  the  instructions, 
this  possibility. 


is  that  the  estimators  conferred  among 
Monitoring  by  section  leaders  precluded 
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the  same  or  greater  over  successive  time  periods.  The  most  restrictive  model 
considered  was  the  strictly  ordered  Intervals  model,  which  assumed  that  the 
performance  decrements  due  to  fatigue  would  Increase  over  successive  time 
periods . 

The  more  restrictive  models  did  not  fare  as  well.  Partially  ordered  inter¬ 
vals  fit  48  of  the  99  cases.  Strictly  ordered  intervals  fit  In  only  30  cases. 
Neither  model  fit  any  Individual  on  all  tasks.  Cases  deviating  from  these  models 
(other  than  the  cases  previously  described)  tended  to  show  a  floor  effect,  with 
estimated  performance  dropping  rapidly,  then  more  slowly  approaching  the  minimum 
performance  obtainable  on  a  task. 

In  summary,  the  high  correlations  among  estimates  appeared  to  be  due  to 
a  large  proportion  of  the  expert  raters  adopting  similar  simplification  strate¬ 
gies.  In  those  cases  where  their  simple  model  happened  to  fit  the  situation,  the 
group's  average  judgment  was  quite  accurate.  However,  the  predictions  did  not 
appear  to  take  Into  account  the  differing  effects  of  fatigue  on  various  types 
of  activity.  The  groups  tended  to  overestimate  the  effects  of  fatigue  on  simple 
motor  skills,  to  underestimate  the  effects  on  cognitive  and  perceptual  tasks, 
and  to  ignore  the  effects  of  potentially  confounding  variables,  such  as 
learning,  lighting,  diurnal  variations,  and  knowledge  that  the  end  of  the  task 
was  near. 

A  few  predictors  appeared  to  deviate  from  the  simplest  model.  One  consis¬ 
tently  assumed  that  performance  would  recover  In  the  last  time  period  (as  It 
often  did).  One  assumed  that  fatigue  would  have  no  effect  on  simple  motor 
skills  and  very  little  effect  on  other  tasks;  he  was  closest  to  the  field  data 
on  the  motor  skills  and  the  least  accurate  predictor  on  the  vigilance  tasks. 

The  research  reported  here  and  earlier  suggests  that  supposed  experts  In 
general,  excepting  meteorologists  In  some  situations,  often  make  no  better 
predictions  than  simple  random  models  are  capable  of  making  and  that  experts 
often  are  actually  poorer  predictors  (Borcherdlng,  1978;  Hogarth,  and  Makrl- 
dakls,  1981;  Lichtenstein,  Flschhoff,  and  Phillips,  1977;  Murphy,  and  Winkler, 
1977;  Slovic,  Flschhoff,  and  Lichtenstein,  1977).  Perhaps  random,  no- 
Informatlon  models  should  be  used  In  simulations  in  place  of  expert  estimates 
of  doubtful  or  unconfirmed  validity.  Bootstrapping,  which  replaces  judges  with 
algebraic  models  of  their  own  weighting  policies,  has  resulted  In  models  that 
perform  as  well  or  better  than  the  judges  themselves  (Slovic,  Flschhoff,  and 
Lichtenstein,  1977).  Finally,  Dawes  and  Corrigan  (1974)  have  demonstrated 
the  extreme  robustness  of  the  simple  linear  model  which  Is  able  to  capture 
most  of  the  variance  in  many  judgmental  and  decision  making  situations,  even 
in  Instances  considered  to  be  Inherently  non-linear.  This  suggests  the  use  of 
a  simple  model  where  actual  performance  measures  are  not  available  — 
essentially  the  same  simplification  strategy  that  the  expert  raters  appeared  to 
adopt  for  this  experiment. 
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CONCLUSION 


Expert  raters'  subjective  estimates  of  performance  were  obtained  for  tasks 
with  known  objective  measures.  The  purpose  was  to  establish  the  validity  or 
lack  of  validity  of  such  expert-rater  estimates.  While  a  high  level  of  inter¬ 
rater  reliability  was  found,  the  ratings  were  shown  to  have  little  or  no 
relationship  to  actual  performance  under  the  described  circumstances.  There¬ 
fore,  one  must  conclude  that  in  situations  similar  to  those  described  here, 
attempts  to  supplement  human  performance  data  with  military  experts'  estimates 
of  performance  in  adverse  environments  may  result  in  inaccurate  and  possibly 
misleading  combat  models. 

When  raters  show  a  high  degree  of  inter-rater  reliability,  there  may  be  a 
temptation  to  accept  the  ratings  as  accurate,  when  in  fact  the  ratings  may  be 
systematically  biased.  The  results  reported  here  demonstrate  the  need  for 
caution  in  accepting  expert  ratings,  even  in  cases  of  high  interrater  reliability 
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Figure  3.  Mean  number  of  hits  per  three  rounds  with  main  gun 
for  successive  12-hour  periods  (tanks  staitlonary, 
targets  moving)  (Araor). 
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Figure  A.  Mean  time  to  fire  first  round  with  main  gun  for 
successive  12-hour  periods  (tanks  stationary, 
targets  moving)'  (Armor). 
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Figure  5.  Mean  ratings  of  hits  per  50  rounds  with  caliber  .50 
machine  gun  for  successive  12-hour  periods  (tanks 
stationary,  targets  moving)  (Armor). 
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Figure  6.  Mean  time  to  fire  caliber  .50  machine  gun,  for 
successive  12-hour  periods (tanks  stationary, 
targets  moving)  (Armor). 
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Figure  7 


Mean  ratings  c^hlts  per  20-round  burst  with  coaxial 
machine  gun  for  successive  12-hour  periods  (tanks 
moving,  target  stationary)  (Axmor) . 
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Figure  8.  Mean  log  obstacle  accuracy  scores  for  successive 
12-hour  periods  (Amor) . 
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Figure  9.  Mean  log  obstacle  crossing  times  for  successive 
12-houi^  periods  (Armor) . 
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Figure  10.  Mean  slalom  accuracy  scores  for  successive 
L2-hour  periods  (Armor). 
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Figure  11.  Mean 'Slalom’ course  times  for  successive 
12-hour  periods  (Armor). 
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Figure  16.  Mean  number  of  maintenance  tasks  correctly 

performed  for  successive  12-hour  periods  (Armor) 
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Figure  17. '  Mean  .times  for  performance  of  maintenance  tasks 
for  successive  12-hour  periods  (Armor). 
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Figure  18.  Mean  latency  to  eonplet ion  of  preplanned  tasks 
for  successive  12-hour  periods  (FDC) . 


-  Laboratory  Exercise 

Rater  Estinate 


(Standard) 


12-Hour  Period 


Figure'  19.  Mean  number  of  preplanned  missions  processed  for 
successive  12-hour  periods  (FDC) . 
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Figure -21.  Mean  number  of  errors  90-990  mils  for  preplanned 
targets  for  successive  12-hour  periods  (FDC) . 
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Figure  22.  Mean  number  of  errors  30  to  89  mils  for  preplanned 
targets  for  successive  12-iiour  periods  (FDC) , 
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Figure  23.  Moan  number  of  errors  15-29  nils  for  preplanned 
targets  for  successive  12-bour  periods  (FDC). 
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Figure  26.  Mean  minber  of  errors  90-990  mils  for  unplanned 

missions  or  subsequent  adjustments  for  successive 
12-hour  per  lods  (FDC) . 
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Figure  27.  Mean  number  of  errors  30-89  mils  for  unplanned 

missions  or  subsequent  adjustments  for  successive 
12-hour  periods  (FDC). 
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Figure  32. 


Mean  number  of  prioritizing  demands  satisfied 
(maximum  possible  44)  for  successive  12-hour  periods 
(includes  completion  of  precomputations  of  firing 
data  for  prioritized  targets  and  designation  of  the 
target  as  priority)  (FDC) . 
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Figure  33.  Mean  prioritizing  latency  for  successive  12-hour 
periods  (FDC). 
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Figure  34.  Mean  number  of  hits  out  of  nine  rounds  fired  in 

vigilance  shooting  task  for  successive  days  (Early  Call  I) 


Figure  35.  Mean  size  of  best  group  of  five  rounds  for 
successive  days  (Early  Call  ?)• 


Figure- 39.  Mean  group  size,  measured  to  the  nearest  quarter 
inch,  for  successive  days  (Early  Call  II). 


